Your browser doesn't support javascript.
Show: 20 | 50 | 100
Results 1 - 20 de 53
Filter
Add filters

Journal
Document Type
Year range
1.
EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference ; : 2644-2656, 2023.
Article in English | Scopus | ID: covidwho-20243588

ABSTRACT

In automated scientific fact-checking, machine learning models are trained to verify scientific claims given evidence. A major bottleneck of this task is the availability of large-scale training datasets on different domains, due to the required domain expertise for data annotation. However, multiple-choice question-answering datasets are readily available across many different domains, thanks to the modern online education and assessment systems. As one of the first steps towards addressing the fact-checking dataset scarcity problem in scientific domains, we propose a pipeline for automatically converting multiple-choice questions into fact-checking data, which we call Multi2Claim. By applying the proposed pipeline, we generated two large-scale datasets for scientific-fact-checking: Med-Fact and Gsci-Fact for the medical and general science domains, respectively. These two datasets are among the first examples of large-scale scientific-fact-checking datasets. We developed baseline models for the verdict prediction task using each dataset. Additionally, we demonstrated that the datasets could be used to improve performance measured by weighted F1 on existing fact-checking datasets such as SciFact, HEALTHVER, COVID-Fact, and CLIMATE-FEVER. In some cases, the improvement in performance was up to a 26% increase. The generated datasets are publicly available. © 2023 Association for Computational Linguistics.

2.
International Conference on Enterprise Information Systems, ICEIS - Proceedings ; 1:57-67, 2023.
Article in English | Scopus | ID: covidwho-20239993

ABSTRACT

Companies continuously produce several documents containing valuable information for users. However, querying these documents is challenging, mainly because of the heterogeneity and volume of documents available. In this work, we investigate the challenge of developing a Big Data Question Answering system, i.e., a system that provides a unified, reliable, and accurate way to query documents through naturally asked questions. We define a set of design principles and introduce BigQA, the first software reference architecture to meet these design principles. The architecture consists of high-level layers and is independent of programming language, technology, querying and answering algorithms. BigQA was validated through a pharmaceutical case study managing over 18k documents from Wikipedia articles and FAQ about Coronavirus. The results demonstrated the applicability of BigQA to real-world applications. In addition, we conducted 27 experiments on three open-domain datasets and compared the recall results of the well-established BM25, TF-IDF, and Dense Passage Retriever algorithms to find the most appropriate generic querying algorithm. According to the experiments, BM25 provided the highest overall performance. Copyright © 2023 by SCITEPRESS - Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

3.
EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of System Demonstrations ; : 1-10, 2023.
Article in English | Scopus | ID: covidwho-20232037

ABSTRACT

Open-retrieval question answering systems are generally trained and tested on large datasets in well-established domains. However, low-resource settings such as new and emerging domains would especially benefit from reliable question answering systems. Furthermore, multilingual and cross-lingual resources in emergent domains are scarce, leading to few or no such systems. In this paper, we demonstrate a cross-lingual open-retrieval question answering system for the emergent domain of COVID-19. Our system adopts a corpus of scientific articles to ensure that retrieved documents are reliable. To address the scarcity of cross-lingual training data in emergent domains, we present a method utilizing automatic translation, alignment, and filtering to produce English-to-all datasets. We show that a deep semantic retriever greatly benefits from training on our English-to-all data and significantly outperforms a BM25 baseline in the cross-lingual setting. We illustrate the capabilities of our system with examples and release all code necessary to train and deploy such a system1 © 2023 Association for Computational Linguistics.

4.
15th International Conference on Developments in eSystems Engineering, DeSE 2023 ; 2023-January:333-338, 2023.
Article in English | Scopus | ID: covidwho-2324254

ABSTRACT

COVID-19 crisis has led to an outburst of information that needs to be organized, validated, and made available to the seekers. Despite the rapid growth and success of BERT models in the last 3 years, COVID QA is a difficult task due to the lack of applicable datasets and a relevant language representation. Therefore, this study proposes a transformer-based Question Answering (QA) model for COVID-19 questions from the biomedical domain. Further, explored several datasets, and models required for question type prediction, no-Answer prediction, and answer extraction and transfer learning strategies. It has been demonstrated that the exact match score can be significantly improved with limited amounts of training data from the biomedical domain. Finally, the findings of the study have been summarized as Factoid QA Finetuning Framework (FQFF), which can provide initial direction for domain-specific QA tasks with a limited amount of data. © 2023 IEEE.

5.
Computacion Y Sistemas ; 26(3):1167-1190, 2022.
Article in English | Web of Science | ID: covidwho-2308030

ABSTRACT

A question answering system that receives as input a question in Spanish and returns the answer is presented. Preguntas y Respuestas {questions and answers} (PryRe) has two main components: 1) An information retrieval component that identifies the meaning of the question using its semantic properties. This component transforms the question into a triplet: R (C, V), where R is the relation or link, C is the concept or main idea, and V is the value of the concept. Example: Cual es la hierba que mejora la digestion? {What is the herb that improves digestion?} becomes R(C, V) = mejora (hierba, digestion) {improves(herb, digestion)}. This component uses natural language processing modules;2) a component that uses the triplet to carry out a query analysis on PryRe's ontology, to identify the answer, which in the example is Manzanilla {Chamomile}. This component performs the semantic identification of the question while traveling on parts of the ontology. Details of the PryRe system are given, as well as tests on herbalism and Coronavirus. It shows an acceptable accuracy (82%). Resources used in this work are (A) a notation used to describe ontologies, and (B) the deductive capability of PryRe.

6.
International Journal of Advanced Computer Science and Applications ; 13(12):277-285, 2022.
Article in English | Web of Science | ID: covidwho-2310517

ABSTRACT

COVID-19 has been a popular issue around 2019 until today. Recently, there has been a lot of research being conducted to utilize a big amount of data discussing about COVID-19. In this work, we conduct a closed domain question answering (CDQA) task in COVID-19 using transfer learning technique. The transfer learning technique is adopted because a large benchmark for question answering about COVID-19 is still unavailable. Therefore, rich knowledge learned from a large benchmark of open domain QA are utilized using transfer learning to improve the performance of our CDQA system. We use retriever-reader framework for our CDQA system, and propose to use Sequential Dependence Model (SDM) as our retriever component to enhance the effectiveness of the system. Our result shows that the use of SDM retriever can improve the F-1 score of the state-of-the-art baseline CDQA system using BM25 and TF-IDF+cosine similarity retriever by 3,26% and 32,62%, respectively. The optimal parameter settings for our CDQA system are found to be as follows: using 20 top-ranked documents as the retriever's output, five sentences as the passage length, and BERT-Large-Uncased model as the reader. In this optimal parameter setting, SDM retriever can improve the F-1 score of the state-of-the-art baseline CDQA system using BM25 by 5,06 % and TF-IDF+cosine similarity retriever by 24,94 %. Our last experiment then confirms the merit of using transfer learning, since our best-performing model (double fine-tune SQuAD and COVID-QA) is shown to gain eight times higher accuracy than the baseline method without using transfer learning. Further fine-tuning the transfer learning model using closed domain dataset (COVID-QA) can increase the accuracy of the transfer learning model that only fine-tuning with SQuAD by 27, 26%.

7.
J Biomed Inform ; 142: 104382, 2023 06.
Article in English | MEDLINE | ID: covidwho-2307390

ABSTRACT

The article presents a workflow to create a question-answering system whose knowledge base combines knowledge graphs and scientific publications on coronaviruses. It is based on the experience gained in modeling evidence from research articles to provide answers to questions in natural language. The work contains best practices for acquiring scientific publications, tuning language models to identify and normalize relevant entities, creating representational models based on probabilistic topics, and formalizing an ontology that describes the associations between domain concepts supported by the scientific literature. All the resources generated in the domain of coronavirus are available openly as part of the Drugs4COVID initiative, and can be (re)-used independently or as a whole. They can be exploited by scientific communities conducting research related to SARS-CoV-2/COVID-19 and also by therapeutic communities, laboratories, etc., wishing to find and understand relationships between symptoms, drugs, active ingredients and their documentary evidence.


Subject(s)
COVID-19 , Humans , SARS-CoV-2 , Pattern Recognition, Automated , Publications
8.
7th Arabic Natural Language Processing Workshop, WANLP 2022 held with EMNLP 2022 ; : 1-10, 2022.
Article in English | Scopus | ID: covidwho-2290872

ABSTRACT

Named Entity Recognition (NER) is a well-known problem for the natural language processing (NLP) community. It is a key component of different NLP applications, including information extraction, question answering, and information retrieval. In the literature, there are several Arabic NER datasets with different named entity tags;however, due to data and concept drift, we are always in need of new data for NER and other NLP applications. In this paper, first, we introduce Wassem, a web-based annotation platform for Arabic NLP applications. Wassem can be used to manually annotate textual data for a variety of NLP tasks: text classification, sequence classification, and word segmentation. Second, we introduce the COVID-19 Arabic Named Entities Recognition (CAraNER) dataset extracted from the Arabic Newspaper COVID-19 Corpus (AraNPCC). CAraNER has 55,389 tokens distributed over 1,278 sentences randomly extracted from Saudi Arabian newspaper articles published during 2019, 2020, and 2021. The dataset is labeled by five annotators with five named-entity tags, namely: Person, Title, Location, Organization, and Miscellaneous. The CAraNER corpus is available for download for free. We evaluate the corpus by finetuning four BERT-based Arabic language models on the CAraNER corpus. The best model was AraBERTv0.2-large with 0.86 for the F1 macro measure. © 2022 Association for Computational Linguistics.

9.
8th China Conference on China Health Information Processing, CHIP 2022 ; 1772 CCIS:156-169, 2023.
Article in English | Scopus | ID: covidwho-2277218

ABSTRACT

Question Answering based on Knowledge Graph (KG) has emerged as a popular research area in general domain. However, few works focus on the COVID-19 kg-based question answering, which is very valuable for biomedical domain. In addition, existing question answering methods rely on knowledge embedding models to represent knowledge (i.e., entities and questions), but the relations between entities are neglected. In this paper, we construct a COVID-19 knowledge graph and propose an end-to-end knowledge graph question answering approach that can utilize relation information to improve the performance. Experimental result shows that the effectiveness of our approach on the COVID-19 knowledge graph question answering. Our code and data are available at https://github.com/CHNcreater/COVID-19-KGQA. © 2023, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

10.
1st Workshop on NLP for COVID-19 at the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020 ; 2020.
Article in English | Scopus | ID: covidwho-2272652

ABSTRACT

We present COVID-QA, a Question Answering dataset consisting of 2,019 question/answer pairs annotated by volunteer biomedical experts on scientific articles related to COVID-19. To evaluate the dataset we compared a RoBERTa base model fine-tuned on SQuAD with the same model trained on SQuAD and our COVID-QA dataset. We found that the additional training on this domain-specific data leads to significant gains in performance. Both the trained model and the annotated dataset have been open-sourced at: https://github.com/deepset-ai/COVID-QA. © ACL 2020.All right reserved.

11.
2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 ; : 148-158, 2022.
Article in English | Scopus | ID: covidwho-2287144

ABSTRACT

The medical conversational system can relieve doctors' burden and improve healthcare effi-ciency, especially during the COVID-19 pan-demic. However, the existing medical dialogue systems have die problems of weak scalability, insufficient knowledge, and poor controlla-bility. Thus, we propose a medical conversa-tional question-answering (CQA) system based on the knowledge graph, namely MedConQA, which is designed as a pipeline framework to maintain high flexibility. Our system utilizes automated medical procedures, including medi-cal triage, consultation, image-text drug recom-mendation, and record. Each module has been open-sourced as a tool, which can be used alone or in combination, with robust scalability. Besides, to conduct knowledge-grounded dia-logues with users, we first construct a Chinese Medical Knowledge Graph (CMKG) and col-lect a large-scale Chinese Medical CQA (CM-CQA) dataset, and we design a series of meth-ods for reasoning more intellectually. Finally, we use several state-of-the-art (SOTA) tech-niques to keep the final generated response more controllable, which is further assured by hospital and professional evaluations. We have open-sourced related code, datasets, web pages, and tools, hoping to advance future research. © 2022 Association for Computational Linguistics.

12.
60th Annual Meeting of the Association for Computational Linguistics, ACL 2022 ; 2022.
Article in English | Scopus | ID: covidwho-2247162

ABSTRACT

The proceedings contain 27 papers. The topics discussed include: UKP-SQUARE: an online platform for question answering research;ViLMedic: a framework for research at the intersection of vision and language in medical AI;TextPruner: a model pruning toolkit for pre-trained language models;AnnIE: an annotation platform for constructing complete open information extraction benchmark;AdapterHub playground: simple and flexible few-shot learning with adapters;QiuNiu: a Chinese lyrics generation system with passage-level input;automatic gloss dictionary for sign language learners;PromptSource: an integrated development environment and repository for natural language prompts;COVID-19 claim radar: a structured claim extraction and tracking system;TS-Anno: an annotation tool to build, annotate and evaluate text simplification corpora;and CogKGE: a knowledge graph embedding toolkit and benchmark for representing multi-source and heterogeneous knowledge.

13.
2022 IEEE International Conference on Big Data, Big Data 2022 ; : 2364-2369, 2022.
Article in English | Scopus | ID: covidwho-2280012

ABSTRACT

Recent advances in the healthcare industry have led to an abundance of unstructured data, making it challenging to perform tasks such as efficient and accurate information retrieval at scale. Our work offers an all-in-one scalable solution for extracting and exploring complex information from large-scale research documents, which would otherwise be tedious. First, we briefly explain our knowledge synthesis process to extract helpful information from unstructured text data of research documents. Then, on top of the knowledge extracted from the documents, we perform complex information retrieval using three major components- Paragraph Retrieval, Triplet Retrieval from Knowledge Graphs, and Complex Question Answering (QA). These components combine lexical and semantic-based methods to retrieve paragraphs and triplets and perform faceted refinement for filtering these search results. The complexity of biomedical queries and documents necessitates using a QA system capable of handling queries more complex than factoid queries, which we evaluate qualitatively on the COVID-19 Open Research Dataset (CORD-19) to demonstrate the effectiveness and value-add. © 2022 IEEE.

14.
45th European Conference on Information Retrieval, ECIR 2023 ; 13982 LNCS:557-567, 2023.
Article in English | Scopus | ID: covidwho-2263971

ABSTRACT

In this paper, we provide an overview of the upcoming ImageCLEF campaign. ImageCLEF is part of the CLEF Conference and Labs of the Evaluation Forum since 2003. ImageCLEF, the Multimedia Retrieval task in CLEF, is an ongoing evaluation initiative that promotes the evaluation of technologies for annotation, indexing, and retrieval of multimodal data with the aim of providing information access to large collections of data in various usage scenarios and domains. In its 21st edition, ImageCLEF 2023 will have four main tasks: (i) a Medical task addressing automatic image captioning, synthetic medical images created with GANs, Visual Question Answering for colonoscopy images, and medical dialogue summarization;(ii) an Aware task addressing the prediction of real-life consequences of online photo sharing;(iii) a Fusion task addressing late fusion techniques based on the expertise of a pool of classifiers;and (iv) a Recommending task addressing cultural heritage content-recommendation. In 2022, ImageCLEF received the participation of over 25 groups submitting more than 258 runs. These numbers show the impact of the campaign. With the COVID-19 pandemic now over, we expect that the interest in participating, especially at the physical CLEF sessions, will increase significantly in 2023. © 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.

15.
Expert Systems with Applications ; 223, 2023.
Article in English | Scopus | ID: covidwho-2263399

ABSTRACT

Because of the frequent occurrence of chronic diseases, the COVID-19 pandemic, etc., online health expert question-answering (HQA) services have been unable to cope with the rapidly increasing demand for online consultations. Building a virtual health assistant based on medical named entity recognition (NER) can effectively assist with the consultation process, but the unstandardized expressions within HQA text pose a serious challenge for medical NER tasks. The main goal of this study is to propose a novel deep medical NER approach based on a collaborative decision strategy (CDS), i.e., co_decision_NER (CDN), that can identify standard and nonstandard medical entities in the HQA context. We collected 10,000 question–answer pairs from HaoDF, extracted medical entities from 15 entity categories, and used a CDS to fuse the advantages of different NER models. Ultimately, CDN achieved a performance (precision = 84.50%, recall = 84.30%, F1 = 84.40%) that was significantly better than that of the state-of-the-art (SOTA) method. Our empirical analysis suggests that the entity types Disease (DIS), Sign (SIG), Test (TES), Drug (DRU), Surgery (SUR), Precaution (PRE), and Region (REG) can be most easily expressed arbitrarily in the doctor–patient interaction scenario of HQA services. In addition, CDN can identify not only standard but also nonstandard medical entities, effectively alleviating the severe out-of-vocabulary (OOV) problem faced by HQA services when performing medical NER tasks. The core contribution of this study is the development of a novel neural network model fusion algorithm that can improve the performance of entity recognition in medical domain-specific tasks. © 2023 Elsevier Ltd

16.
Procedia Comput Sci ; 219: 388-396, 2023.
Article in English | MEDLINE | ID: covidwho-2257362

ABSTRACT

The paper discusses the design and implementation process of an intelligent system for answering specialized questions about COVID-19. The system is based on deep learning and transfer learning techniques and uses the popular CORD-19 dataset as a source of scientific knowledge about the problem domain. The experiments performed with the pilot version of the system are presented and the obtained results are analyzed. Conclusions are formulated about the applicability and the opportunities for improvement of the proposed approach.

17.
Advanced Data Mining and Applications (Adma 2022), Pt I ; 13725:259-274, 2022.
Article in English | Web of Science | ID: covidwho-2236377

ABSTRACT

Question answering over knowledge bases (KBQA) has become a popular approach to help users extract information from knowledge bases. Although several systems exist, choosing one suitable for a particular application scenario is difficult. In this article, we provide a comparative study of six representative KBQA systems on eight benchmark datasets. In that, we study various question types, properties, languages, and domains to provide insights on where existing systems struggle. On top of that, we propose an advanced mapping algorithm to aid existing models in achieving superior results. Moreover, we also develop a multilingual corpus COVID-KGQA, which encourages COVID-19 research and multilingualism for the diversity of future AI. Finally, we discuss the key findings and their implications as well as performance guidelines and some future improvements. Our source code is available at https://github.com/tamlhp/kbqa.

18.
Information Discovery and Delivery ; 2023.
Article in English | Scopus | ID: covidwho-2233762

ABSTRACT

Purpose: This study aims to evaluate a method of building a biomedical knowledge graph (KG). Design/methodology/approach: This research first constructs a COVID-19 KG on the COVID-19 Open Research Data Set, covering information over six categories (i.e. disease, drug, gene, species, therapy and symptom). The construction used open-source tools to extract entities, relations and triples. Then, the COVID-19 KG is evaluated on three data-quality dimensions: correctness, relatedness and comprehensiveness, using a semiautomatic approach. Finally, this study assesses the application of the KG by building a question answering (Q&A) system. Five queries regarding COVID-19 genomes, symptoms, transmissions and therapeutics were submitted to the system and the results were analyzed. Findings: With current extraction tools, the quality of the KG is moderate and difficult to improve, unless more efforts are made to improve the tools for entity extraction, relation extraction and others. This study finds that comprehensiveness and relatedness positively correlate with the data size. Furthermore, the results indicate the performances of the Q&A systems built on the larger-scale KGs are better than the smaller ones for most queries, proving the importance of relatedness and comprehensiveness to ensure the usefulness of the KG. Originality/value: The KG construction process, data-quality-based and application-based evaluations discussed in this paper provide valuable references for KG researchers and practitioners to build high-quality domain-specific knowledge discovery systems. © 2022, Emerald Publishing Limited.

19.
J Am Med Inform Assoc ; 2022 Nov 17.
Article in English | MEDLINE | ID: covidwho-2236921

ABSTRACT

OBJECTIVE: The rapidly growing body of communications during the COVID-19 pandemic posed a challenge to information seekers, who struggled to find answers to their specific and changing information needs. We designed a Question Answering (QA) system capable of answering ad-hoc questions about the COVID-19 disease, its causal virus SARS-CoV-2, and the recommended response to the pandemic. MATERIALS AND METHODS: The QA system incorporates, in addition to relevance models, automatic generation of questions from relevant sentences. We relied on entailment between questions for (1) pinpointing answers and (2) selecting novel answers early in the list of its results. RESULTS: The QA system produced state-of-the-art results when processing questions asked by experts (eg, researchers, scientists, or clinicians) and competitive results when processing questions asked by consumers of health information. Although state-of-the-art models for question generation and question entailment were used, more than half of the answers were missed, due to the limitations of the relevance models employed. DISCUSSION: Although question entailment enabled by automatic question generation is the cornerstone of our QA system's architecture, question entailment did not prove to always be reliable or sufficient in ranking the answers. Question entailment should be enhanced with additional inferential capabilities. CONCLUSION: The QA system presented in this article produced state-of-the-art results processing expert questions and competitive results processing consumer questions. Improvements should be considered by using better relevance models and enhanced inference methods. Moreover, experts and consumers have different answer expectations, which should be accounted for in future QA development.

20.
International Journal of Advanced Computer Science and Applications ; 13(12), 2022.
Article in English | ProQuest Central | ID: covidwho-2226285

ABSTRACT

COVID-19 has been a popular issue around 2019 until today. Recently, there has been a lot of research being conducted to utilize a big amount of data discussing about COVID-19. In this work, we conduct a closed domain question answering (CDQA) task in COVID-19 using transfer learning technique. The transfer learning technique is adopted because a large benchmark for question answering about COVID-19 is still unavailable. Therefore, rich knowledge learned from a large benchmark of open domain QA are utilized using transfer learning to improve the performance of our CDQA system. We use retriever-reader framework for our CDQA system, and propose to use Sequential Dependence Model (SDM) as our retriever component to enhance the effectiveness of the system. Our result shows that the use of SDM retriever can improve the F-1 score of the state-of-the-art baseline CDQA system using BM25 and TF-IDF+cosine similarity retriever by 3,26% and 32,62%, respectively. The optimal parameter settings for our CDQA system are found to be as follows: using 20 top-ranked documents as the retriever's output, five sentences as the passage length, and BERT-Large-Uncased model as the reader. In this optimal parameter setting, SDM retriever can improve the F-1 score of the state-of-the-art baseline CDQA system using BM25 by 5,06 % and TF-IDF+cosine similarity retriever by 24,94 %. Our last experiment then confirms the merit of using transfer learning, since our best-performing model (double fine-tune SQuAD and COVID-QA) is shown to gain eight times higher accuracy than the baseline method without using transfer learning. Further fine-tuning the transfer learning model using closed domain dataset (COVID-QA) can increase the accuracy of the transfer learning model that only fine-tuning with SQuAD by 27, 26%.

SELECTION OF CITATIONS
SEARCH DETAIL